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Abstract 

In this work we explore the performance of CUDA in quenched lattice SU(2) simulations. 
CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software ar- 
chitecture developed by NVIDIA for computing on the GPU. We present an analysis 
and performance comparison between the GPU and CPU in single and double precision. 
Analyses with multiple GPUs and two different architectures (G200 and Fermi archi- 
tectures) are also presented. In order to obtain a high performance, the code must be 
optimized for the GPU architecture, i.e., an implementation that exploits the memory 
hierarchy of the CUDA programming model. 

We produce codes for the Monte Carlo generation of SU(2) lattice gauge configura- 
tions, for the mean plaquette, for the Polyakov Loop at finite T and for the Wilson loop. 
We also present results for the potential using many configurations (50 000) without 
smearing and almost 2 000 configurations with APE smearing. With two Fermi GPUs 
we have achieved an excellent performance of 200 x the speed over one CPU, in single 
precision, around 110 Gflops/s. We also find that, using the Fermi architecture, double 
precision computations for the static quark-antiquark potential are not much slower (less 
than 2x slower) than single precision computations. 
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1. Introduction 

Graphics Processing Units (GPUs) have become important in providing processing 
power for high performance computing applications. CUDA [1, 2] is a proprietary API 
and set of language extensions that works only on NVIDIA's GPUs and call a piece of 
code that runs on the GPU, a kernel. 

In 2007, NVIDIA released CUDA for GPU computing as a language extension to 
C. CUDA makes the GPU programming and computing development easier and more 
efficient than the earlier attempts, using OpenGL and associated shader languages, in 
which it was necessary to translate the computation to a graphics language [3] . 
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The most successful theories that describe elementary particle physics are the so called 
gauge theories. SU(2) is an interesting gauge group, either to simulate the electroweak 
theory, or to use as a simplified case of the SU(3) gauge group of the strong interaction. 
Gauge theories can be addressed by lattice field theory in a non-perturbative approxi- 
mation scheme based on the path integral formalism in which space-time is discretized. 
Quantities in the form of a path integral can be transformed to Euclidean space-time, 
which can be evaluated numerically using Monte Carlo simulations allowing us to use 
statistical mechanics methods. 

Generating SU(N) lattice configurations is a highly demanding task computationally 
and requires advanced computer architectures such as CPU clusters or GPUs. Compared 
with CPU clusters, GPUs are easier to access and maintain, as they can run on a local 
desktop computer. There are some groups [4, 5, 6] using GPUs to accelerate their lattice 
simulations, however this is for the full Lagrangian description. 

In this work, we make use of the new GPU technologies to accelerate the calculations 
in pure gauge lattice SU(2). Note that pure gauge lattice simulations do not include the 
full Lagrangian description, i.e., dynamical fermions. In particular, we are able to perform 
our computations integrally in the GPU, thus reaching a quite higher benchmarks when 
compared with previous computations partially done in the GPU and partially done in 
the CPU. 

This paper is divided in 6 sections. In section 2, we present a brief description on 
how to generate lattice SU(2) configurations and in section 3 we give an overview of 
the GPU hardware and the CUDA programming model. In section 4 we show how to 
generate lattice SU(2) configurations and calculate the static quark-antiquark potential 
in one GPU or in multiple GPUs. In section 5 we present the GPU performance over 
one CPU core, as well as results for the mean average plaquette and Polyakov loop for 
different ft and lattice sizes. We also present the static quark-antiquark potential with 
and without APE smearing. Finally, in section 6, we conclude. 



2. SU(2) Lattice Gauge Theory 

In this section, we describe the heat bath algorithm for generating SU(2) configura- 
tions [7, 8]. In SU(2), any group element U may be parametrized in the form, 

U = a 1 + ia ■ a , (1) 
where the a are the usual Pauli matrices and where 

a 2 = al + a 2 = 1 . (2) 
This condition defines the unitary hyper-sphere surface S 3 and 

Tr[/ = 2a , UU^ = U^U = t, det U = 1 . (3) 
The invariant Haar group measure is given by 

dU = -^<S (a 2 - 1) d i a , (4) 



where l/(27r 2 ) is a normalization factor. 
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Figure 1: Plaquette P^ u {s). 
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In order to update a particular link, we need only to consider the contribution to 
the action from the six plaquettes containing that link, the staple V . The plaquette 
is illustrated in Fig. 1. Notice that the pure gauge SU(2) lattice action is composed 
by the sum of all possible plaquettes, but all the other plaquettes factor out from the 
expectation value of a particular link. The distribution to be generated for every single 
link is given by 

"1 
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where (3 = 4/c/q, go is the coupling constant. We apply a useful property of SU(2) 
elements, that any sum of them is proportional to another SU(2) element U, 
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Thus, we need to generate ao G [ — 1, 1] with distribution, 

P(a ) cx \Jl-al exp(f3ka ) 



(6) 
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The components of a are generated randomly on the 3D unit sphere in a four dimensional 
space with exponential weighting along the ao direction. Once the ao and a are obtained 
in this way, the new link is updated, 



U 1 = UU~ 



(9) 



In order to accelerate the decorrelation of subsequent lattice configurations, we can 
employ the over-relaxation algorithm, 



isr iei ' 
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Figure 2: Polyakov loop. 



where S is the staple. 

S = Y.( U *>» U °+ 1 ># U l + to + ul_ . M U x ^U x -o + ^) , (11) 

and |S| = Vdct S. 

The simplest measurement that can be done in the lattice is the average plaquette. 
The average plaquette, (P), is given by, 

< P > = ^ E E^oo. (12) 

s G lattice 

where V is the lattice volume and P fiu (s) 1 see Fig. 1, is 

P MV (a) = l-±BeTr [U»{8)U v (8 + fi)Ut(8 + fi)Ut{sj\ . (13) 

Another interesting operator that can be calculated in the lattice is the expectation 
value of the Polyakov loop, (L) [9, 10], 

< L > = ^rE L ( x )' ( 14 ) 

JVff X 

where N a = N x x N y x N z . The product of link variables on the temporal direction L (x) 
is depicted in Fig. 2, some times this is called a Wilson line, 

, N t -i 

L(x) = -Tr JT C/ 4 (x,i) , (15) 
t=o 

where [/ 4 is the link along the temporal direction. Since we employ periodic boundary 
conditions in time direction, f7^(x, 0) = U^{~x., Nt), and in space direction, this is equiva- 
lent to a closed loop. The expectation value of the Polyakov loop is the order parameter 
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Figure 3: Wilson Loop operator on the lattice. 



for the deconfinement transition on an infinite lattice [9] . The order parameter measures 
the free energy, F q of a single static (infinite mass) quark at temperature T, 



(L) oc cxp ^-^f 

where T is connected to the lattice spacing a by 

1 

T = 



N t a 



(16) 



(17) 



When (L) = 0, the free energy of the quark increases arbitrarily with the volume, and 
this is interpreted as a signal of quark confinement. When (L) ^ 0, the free energy 
of the quark tend to a constant for large volume, and this is interpreted as a signal of 
deconfinement . 

We can also extend the square of size lxl, i.e., the plaquette, to construct an operator 
with a larger size, the Wilson loop. The Wilson loop, depicted in Fig. 3, is given by, 



W(R,T) = T±[U»(0,0)---U^((R-l)fi,0)U A (Rfi,0)---U4(Rfi,T-l) 
UU(R - T) ■ ■ ■ UUO, T)Ul(0, T -!)■■■ u\(0, 0) 



(18) 



where R is the spatial direction and T is the temporal direction. Note that the smallest 
non-trivial Wilson loop on the lattice is the plaquette. The mean value of the Wilson 
loop is utilized to compute the static quark-antiquark potential. 

In order to improve the signal to noise ratio of the Wilson loop, we can use the APE 
smearing. The APE smearing is a gauge equivariant prescription for averaging a link 
U^(x) with its nearest neighbours, 



(19) 



where Psu(2) is a projector back onto the SU(2) group, w — 0.2 and iterate this procedure 
25 times in the spatial direction. Empirically, it is seen that using a smeared operator 
helps to improve ground-state overlap dramatically. 
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Figure 4: Fermi Architecture. Fermi's 16 streaming multiprocessors are positioned 
around a common L2 cache. Each streaming multiprocessors is a vertical rectangular 
strip that contain an orange portion (scheduler and dispatch), a green portion (execu- 
tion units), and light blue portions (register file and LI cache) [11]. 



3. Cuda Programming Model 

CUD A [1, 2] is the hardware and software that enables NVIDIA CPUs to execute 
programs written with languages such as C, CH — h, Fortran, OpenCL and DirectComputc. 

CUDA programs call parallel kernels, each of which executes in parallel across a set of 
parallel threads. These threads are then organized, by the compiler or the programmer, 
in thread blocks and grids of thread blocks. 

The GPU instantiates a kernel program on a grid of parallel thread blocks. Within 
the thread blocks, an instance of the kernel will be executed by each thread, which 
has a thread ID within its thread block, program counter, registers, per-thread private 
memory, inputs, and output results. Thread blocks are sets of concurrently executing 
threads, cooperating among themselves by barrier synchronization and shared memory. 
Thread blocks also have block ID's within their grids. A grid is an array of thread blocks. 
This array executes the same kernel, reads inputs from global memory, writes results to 
global memory, and synchronizes between dependent kernel calls. 

In the CUDA parallel programming model, each thread has a per-thread private 
memory space used for register spills, function calls, and C automatic array variables. 
Each thread block has a per-Block shared memory space used for inter-thread commu- 
nication, data sharing, and result sharing in parallel algorithms. Grids of thread blocks 
share results in Global Memory space after kernel-wide global synchronization. 

In the hardware execution view, CUDA's hierarchy of threads maps to a hierarchy of 
processors on the GPU; a GPU executes one or more kernel grids; a streaming multipro- 
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Table 1: NVIDIA's architecture specifications (SM means Streaming Multiprocessor) 
Source [11]. 

cessor (SM) executes one or more thread blocks; and CUD A cores and other execution 
units in the SM execute threads. The SM executes threads in groups of 32 threads called 
a warp. While programmers can generally ignore warp execution for functional correct- 
ness and think of programming one thread, they can greatly improve performance by 
having threads in a warp executing the same code path and accessing memory in nearby 
addresses. 

The first Fermi based GPU implemented with 3.0 billion transistors, features up to 
512 CUDA cores, organized in 16 SMs of 32 cores each. A CUDA core executes a floating 
point or integer instruction per clock for a thread. In Fig. 4 and Table 1 we present 
the details of the Fermi architecture. The GPU has six 64-bit memory partitions, for a 
384-bit memory interface, and supports up to a total of 6 GB of GDDR5 DRAM memory. 
The connection of the GPU to the CPU is made by a host interface via PCI-Express. 
GigaThread global scheduler distributes thread blocks to SM thread schedulers. 

The Fermi architecture [11] represents the most important improvement in GPU 
architecture since the original G80, an early vision on unified graphics and computing 
parallel processor. GT200 extended its performance and functionality. Table 1 shows the 
details between the different architectures (G80, GT200 and Fermi architectures). With 
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Fermi, NVIDIA used the knowledge from the two prior processors and all the applications 
that were written for them, and employed a completely new approach to design and to 
create the world's first computational GPU. 

The Fermi team designed a processor, Fig. 4, that highly increases not only raw 
compute horsepower, but also, at the same time, the programmability and computational 
efficiency using architectural innovations. They made improvements in double precision 
performance, a true cache hierarchy since some algorithms cannot take advantage of the 
Shared memory resources (NVIDIA Parallel DataCache hierarchy with configurable LI 
and unified L2 caches), have more shared memory, faster context switching and faster 
atomic operations. 

In all GPUs architectures, it is necessary to take into account the following perfor- 
mance considerations: memory coalescing, shared memory bank conflicts, control-flow 
divergence, occupancy and kernel launch overheads. 

4. Mapping Lattice SU(2) to GPU 

In this section, we discuss the parallelization scheme for generating pure gauge SU(2) 
lattice configurations. 

A CUDA application works by spawning a very large number of threads on the 
GPU which are executed in parallel. The threads are grouped in thread blocks and 
the entire collection of blocks is called a grid. CUDA provides primitives that allow 
the synchronization within a thread block. However, it is not possible to synchronize 
threads within different thread blocks. In order to avoid the penalty for high latency, 
we must ensure a high multiprocessor occupancy, i.e., each multiprocessor should have 
many threads simultaneously loaded and waiting for execution. In this work, we assign 
one thread to each lattice site and in all runs we maintain the thread block size fixed. 
Since CUDA only supports thread blocks up to 3D and grids up to 2D, and the lattice 
needs four indexes, we use 3D thread blocks, one for f, one for z and one for both x and 
y. We then reconstruct the other index inside the kernel. 

We place most of the constants needed by the GPU, like the number of points in the 
lattice, in the constant memory using cudaMemcpyToSymbol, as in the following example 

cudaMemcpyToSymbol ( "Nx" , &Nx, sizeof(int) ); 

The code to obtain the four indices of the 4D hypercube, when using a single GPU, inside 
the kernel is 

int blockldxz = f loat2int_rd(blockIdx. y * invblocky) ; 

int blockldxy = blockldx.y - umul24 (blockldxz, blocks_y) ; 

int ij = mul24(blockIdx.x, blockDim.x) + threadldx.x; 

//Index's of 4D hyper-cube 

int i = mod(ij , Nx) ; 

int j = float2int_rd(ij / Nx) ; 

int k = mul24 (blockldxy , blockDim.y) + threadldx.y; 

int t = mul24 (blockldxz, blockDim.z) + threadldx.z; 



and outside the kernel we define, 
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threads_x = mineq(Nx * Ny, 16) ; 
threads_y = mineq(Nz, 4); 
threads_z = mineq(Nt, 4); 

blocks_x = (Nx * Ny + threads_x - 1) / threads_x; 
blocks_y = (Nz + threads_y - 1) / threads_y; 
blocks_z = (Nt + threads_z - 1) / threads_z; 

block = make_uint3(threads_x, threads_y, threads_z) ; 
grid = make_uint3(blocks_x, blocks_y * blocks_z, 1); 
invblocky = 1 . Of / (T)blocks_y; 

where mineqO is a function that returns the minimum value. A kernel is then defined, 
for example, as 

Cold_Start<T4>«< grid, block »> (lattice_d) ; 

Note that in the Polyakov loop kernel we only need three indexes and we can use the 
3D thread blocks, i.e., in the kernel, we use 

int blockldxz = f loat2int_rd(blockIdx. y * invblocky_3D) ; 

int blockldxy = blockldx.y - umul24 (blockldxz, blocky_3D) ; 

int i = mul24(blockIdx.x,blockDim.x) + threadldx.x; 

int j = mul24 (blockldxy ,blockDim.y) + threadldx.y; 

int k = mul24 (blockldxz ,blockDim.z) + threadldx.z; 

and each thread make the temporal link multiplication from t = to t = N t — 1 and the 
number of thread blocks and the number of block is defined as, 

threads_x = mineq(Nx, 8); 
threads_y = mineq(Ny, 8) ; 
threads_z = mineq(Nz, 8); 

blocks_x = (Nx + threads_x - 1) / threads_x; 

blocks_y = (Ny + threads_y - 1) / threads_y; 

blocks_z = (Nz + threads_z - 1) / threads_z; 

block_3D = make_uint3(threads_x, threads_y, threads_z) ; 
grid_3D = make_uint3(blocks_x, blocks_y * blocks_z, 1); 
invblocky_3D = 1 . Of /(T)blocks_y ; 
blocky_3D = blocks_y; 

Since memory transfers between CPU and GPU are very slow comparing with other 
GPU memory and in order to maximize the GPU performance, we should only use this 
feature when it is extremely necessary. Hence, we only use CPU/GPU memory transfers 
in three cases: in the initial array of seeds for the random number generator in the GPU, 
in the end of the kernel to perform the sum over all lattice sites (copy the final result to 
CPU memory) and when using multi-GPUs (exchange the border cells between GPUs). 

The kernels developed for this work are: 
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• Random number generator, RNG; 



• Lattice initialization: 

— Cold start, U = 1; 

— Hot start, random SU(2) matrix; 

— Read a configuration from input file. 

• Heat bath algorithm: 

• Over-relaxation method; 

• Plaquette (for each site); 

• Polyakov Loop (for each site) ; 

• Wilson Loop (for each site); 

• APE Smearing; 

• Parallel reduction. Sum over all sites of an array. This kernel performs a sum over 
all sites after calculation of the plaquette, Polyakov loop and Wilson loop. 

For the generation of the random numbers needed in the hot start lattice initialization 
and in the heat bath algorithm, we use a linear congruential random number generator 
(LCRNG) [12], given by 

Xi+i.j — {axi.j + b) mod m , (20) 

and 

x o,j+i = ( cx o,j) mod m , (21) 

with a = 1664525, b = 1013904223, c = 16807, m = 2147483647 and x nfi = 1- We 
generate the first random numbers xq j in the CPU and then copy the array to the GPU. 
Therefore, we can generate a different random number in each GPU thread. 

The LCRNG is used only in the performance tests since this type of random number 
generator is not suitable for production running. However, in the results we use the 
random number generator included with the NVIDIA Toolkit 3.2 RC2. CURAND library 
[13]- 

For the lattice array we cannot use in CUDA a four dimensional array to store the 
lattice. Therefore we use a ID array with size N x x N y x N z x N t x Dim and a f loat4, 
in the case of single precision, or double4, for double precision, to store the generators of 
SU(2) (a , oi, a-2 and 03). Then, we need to construct all the CUDA operators to make 
all the operations needed. In this way, we only need four floating point numbers per link 
instead of having a 2 x 2 complex matrix. In order to select single or double precision, 
we use templates in the code. 

In the heat bath and over-relaxation methods, since we need to calculate the staple 
at each link direction and given the GPU architecture, we use the chessboard method, 
calculating the new links separately by direction and by even and odd sites. 

The Plaquette, Polyakov Loop and Wilson Loop kernels are used to calculate the 
plaquette, the Polyakov loop and the Wilson loop by lattice site. In the end we need to 
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perform the sum over all lattice sites. To make this sum, we use the parallel reduction 
code (kernel 6) in the NVIDIA GPU Computing SDK package [14, 15]. 

Although CUDA neither supports explicitly double textures nor supports double4 tex- 
tures, it is possible to bind a double4 array to a texture and then retrieve double4 values. 

This can be done by declaring the texture as int4 and then using hiloint2double to 

cast it to double, as in the following code example: 

texture<int4, 1, cudaReadModeElementType> tex_lattice_double ; 

device double4 f etch_lat (double4 *x, int i){ 

#if __CUDA_ARCH__ >= 130 

// double requires Compute Capability 1.3 or greater 

if (UseTex) 

{ 

int4 v = texlDf etch(tex_lattice_double , 2 * i) ; 

int4 u = texlDf etch(tex_lattice_double , 2 * i + 1) ; 

return make_double4( hiloint2double(v.y, v.x), 

hiloint2double(v.w, v.z), 

hiloint2double(u.y, u.x), 

hiloint2double(u.w, u.z)); 

} 

else 

return x [i] ; 

#else 

return x [i] ; 
#endif 
} 

moreover, float textures are declared and accessed as, 

texture<f loat4 , 1, cudaReadModeElementType> tex_lattice; 

device float4 f etch_lat (f loat4 *x, int i){ 

if (UseTex) 

return texlDf etch(tex_lattice , i) ; 

else 

return x [i] ; 

} 

We now address the multi-GPU approach. The Multi-GPU part was implemented 
using CUDA and OPENMP, each CPU thread controls one GPU. Each GPU computes 
N a x ^ . The total length of the array in each GPU is then N a x ( ^ h 2). 

u num. g pus ° J u v num. gpus n 

see Fig. 5. At each iteration, the links are calculated separately by even and odd lattice 
sites and by the direction [i. Before calculating the next direction, the border cells in 
each GPU need to be exchanged between each GPU. On the border of each lattice, at 
least one of the neighboring sites is located in the memory of another GPU, see Fig. 5b. 
For this reason, the links at the borders of each lattice have to be transferred from one 
GPU to the GPU handling the adjacent lattice. In order to exchange the border cells 
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Figure 5: Schematic view of the lattice array handled by each GPU. 



between GPUs it is necessary to copy these cells to CPU memory and then synchronize 
each CPU thread with the command #pragma omp barrier before updating the GPU 
memory, ghost cells. 



5. Results 

Here we present the benchmark results using two different GPU architectures (GT200 
and Fermi) in generating pure gauge lattice SU(2) configurations. We also compare the 
performance with two Fermi GPUs working in parallel in the same mother-board, using 
CUDA and OPENMP. 

Results for the mean average plaquette and Polyakov loop are also presented. Fi- 
nally, the static quark-antiquark potential is calculated on GPUs using single and double 
precision. We also present results with smeared and unsmeared configurations, as well 
as the results obtained for the lattice spacing with (3 = 2.8. In these results, we didn't 
use any step of over-relaxation. 

Our code can be downloaded from the Portuguese Lattice QCD collaboration home- 
page [16]. 



5.1. Performance of Monte Carlo Generator 

In this section, we compare the performance between GPU's, see table 2, (two dif- 
ferent architectures, NVIDIA GTX 295, GT200 architecture, and NVIDIA GTX 480, 
FERMI architecture) and a CPU (Intel Core i7 CPU 920, 2.67GHz, 8 MB L2 Cache and 
12 GB of RAM). We compare the performance in generating pure gauge lattice SU(2) 
configurations and measure the mean average plaquette for each iteration with /3 = 6.0, 
hot start initialization and 100 iterations in single and double precision. 

In Fig. 6, we present the performance results using NVIDIA GPUs, NVIDIA GTX 
295 (with 2 GPUs per board) and 2 NVIDIA GTX 480 (with 1 GPU per board) versus 
one CPU core. Our CPU code to generate a random SU(2) matrix and the heat bath 
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algorithm is simply the same code we developed for the GPU, except that the memory 
and process transfers to the GPU are different, as well as the process to sum an array 
(calculate the mean plaquette value). Moreover, our CPU code is not implemented with 
SSE instructions. In Fig. 7, we show the GPU performance in Gflops/s. We run the 
code for 100 iterations, starting with a random SU(2) configuration. In the heat bath 
algorithm, we only perform one try to update the link. In this way, we measure the 
flops in all kernels used (kernel to initialize the random SU(2) configuration, heat bath 
kernel, plaquette kernel and the parallel reduction kernel). Although the GPU peak 
performance is around one Tflops/s in single performance, the performance achieved by 
our code, around 70 Gflops/s using one Fermi GPU, is significantly affected by the large 
memory transfers, i.e., for each try to update one gauge link, we need to copy from global 
memory 19 links (19xfloat4(double4)) plus one unsigned int in the random array and to 
calculate the plaquette at each lattice site we need to copy 24 links (24xfloat4(double4)). 
Note that in the heat bath kernel we need to calculate new random numbers but this is 
not accounted in the number of flops, as well as we only count one instruction for log(), 
cos(), sin() and sqrtQ functions. The CPU (Intel Core i7 CPU 920, 2.67GHz, 8 MB L2 
Cache and 12 GB of RAM) performance in one core is almost constant as the lattice size 
increases, 510-520 Mflops/s. 

The memory access inside the GPUs was done using two methods, one using textures 
and the other one using the global memory in the NVIDIA GTX 295 case and the cache 
memory in NVIDIA GTX 480. We don't use the shared memory because it is a resource 
too small to fit in our problem. We only show the performance tests for a maximum 
lattice array that can fit in our GPU memory. Using only one Fermi GPU, the maximum 
lattice array size in the GPU memory is 66 4 and 56 4 for single and double precision, 
respectively. 

In the Fermi architecture there is not much difference between using textures or 
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Figure 6: Performance results. 295 - NVIDIA Geforce 295 GTX; 480 - NVIDIA Geforce 
480 GTX; (1) - with 1 GPU; (2) - with 2 CPUs; Tex - using textures; GM - using global 
memory. 



accessing to global memory when using single precision. This is because of the new 
cache hierarchy (LI and L2 cache). In architectures prior to Fermi, there is no cache 
hierarchy, therefore, when using textures on these architectures, we can achieve a higher 
performance in comparison to accessing to the Global memory. However, when using 
textures there is a limitation of the array size, the maximum width for a ID texture 
reference bound to linear memory is 2 27 , independent of the GPU architecture. 

Splitting the lattice array in four, i.e., one array for each link direction, we can achieve 
1.4x the speed over using only one single array to store all the lattice. However, using 
four arrays makes it harder to add new code, since it forces us to write the code more 
explicitly and the programming errors are more difficult to find. Thus we prefer to use 
a single array. 

5.2. Plaquette 

The measurement of the average plaquette is defined as the average trace of each 
plaquette, as defined in Eq. (12), in all configurations and is the simplest measurement 
that can be done in the lattice. In Fig. 8, we present the results for the mean average 
plaquette, as well as the analytic predictions, for different j3, with 10 000 configurations 
and 32 4 lattice size. 

We are able to perform, at least 3 million Monte Carlo steps per day and calculate 
the mean average plaquette, in the case of a 32 4 lattice using the two Fermi GPUs. For 
a 64 4 lattice size, we perform 250 000 iterations per day. 

5.3. Polyakov Loop 

We now test the GPU performance measuring the Polyakov loop at each generated 
lattice SU(2), in the same conditions made in the performance tests of Subsection 5.2 . 
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20 30 40 50 
n, Lattice size (n*) 

(a) Single precision. 



20 30 40 50 
n, Lattice size (n 4 ) 

(b) Double precision. 



Figure 7: Performance results in Gflops/s for 100 iterations starting with a random 
SU(2) matrix and one try to update a gauge link. 295 - NVIDIA Geforce 295 GTX; 480 
- NVIDIA Geforce 480 GTX; (1) - with 1 GPU; (2) - with 2 CPUs; Tex - using textures; 
GM - using global memory. Notice that our code, in one CPU core has a performance 
of 510 Mflops/s to 520 Mflops/s. 
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Figure 8: Mean average plaquette for 32 4 lattice size (data points) and analytic predic- 
tions (denoted by dashed lines). 
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Figure 9: j3 dependence of the mean average Polyakov loop from Monte Carlo simulation. 



The performance is almost the same, 1.1 x, compared with only measuring the average 
plaquette. Fig. 9 shows the expectation value of the Polyakov loop as a function of (3 = 
4/<?o (5o is the coupling constant), for several lattice sizes and using 10 000 configurations. 
The confinement is evident at high couplings, while the deconfinement occurs at small 
couplings, i.e., the Polyakov loop is zero at high couplings and then at certain critical 
coupling value it rises to a finite value. As can be seen, the shape of the curve depends 
on the temporal size, related to the temperature T of the lattice, when the spatial size 
is kept fixed at N a = 48 3 . 

5.4- The static quark- antiquark potential 

The static quark-antiquark potential, i. e. the potential between two infinitely heavy 
quarks, has the following long distance expansion, 



aV(aR) = A a 



D 
R 



a a R , 



(22) 



where V(R) is the static quark-antiquark potential, a is the lattice spacing, A is a constant 
term, B is the coefficient to the Coulomb term, a is the string tension and R is the 
distance in lattice units. The extraction of the signal of the static quark potential from 
thermalized lattice gauge configurations is given by, the effective mass plot, 



V{R) = In 



(W(R,T)) 
(W(R,T + 1)) 



(23) 



since 



(W(R,T)) =e- L V ^ H > . (24) 

In Fig. 10, we show the fit results for the static quark-antiquark potential using two 
GPU architectures (GT200 and Fermi). Results in single precision from both architec- 
tures are presented, as well as the results from double precison from Fermi architecture. 
All these results agree within our error bars. 

In Fig. 11, we show the results for the static quark-antiquark potential with j3 — 2.8 
and 24 3 x 48 lattice size, using the Fermi GPU. Importantly, we show our results obtained 
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Figure 10: Fit to the static quark-antiquark potential (in lattice units) for 1981 24 3 x 32 
configurations with j3 = 2.5 and with APE smearing. Comparison between two different 
architectures (GT200 and Fermi) in single precision and in double precision for the Fermi 
architecture. 



with APE smearing, and without no smearing at all. In Table 3, we show the values 
obtained for the lattice spacing a as well as the number of configurations used. The 
lattice spacing, a, was calculated using the relation C = a a 2 , where C is the value 
obtained from the linear part of the fit and a the physical value for the string tension, 
V^ = 440MeV, i.e., 

/-197MeV 

a = ^440MeV (fm) ' (25) 
Importantly, utilizing the computational power of the GPUs, we can now afford to 
calculate the static quark-antiquark potential using thousands of configurations, to study 
whether the results obtained with and without smearing are in agreement. Note that 
the quark-antiquark potential has already been extensively studied [17, 18, 19, 20, 21], 
either for small interquark distances or using different smearing techniques, like the 
APE smearing, but usually fail to pick up a significant signal for long distances with no 
smearing. The APE smearing, or other smearing method, have the property to enhance 
the ground state and therefore decouple it from excitations effectively, since the ground 
state wave function is always the smoothest wave function within any given channel. The 
use of APE smearing is an important tool in order to obtain a clear plateau in Eq. (23). 

In Table 3 for (3 = 2.8 and in Fig. 11 we compare our results with and without 
smearing. Although, the unsmeared configurations have larger contribution from the ex- 
cited states, we can extract the static potential, noting that the number of configurations 
needed to obtain a good signal are indeed quite large. We confirm that smearing, or at 
least APE smearing, get a potential consistent within error bars to the one produced by 
unsmeared configurations. 
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Figure 11: Fit to the static quark-antiquark potential (in lattice units), with and without 
applying smearing with j3 = 2.8 and 24 3 x 48. 



p 


era 2 


a (fm) 


Lattice size 


APE Smearing 


# of config. 


2.5 


0.036623(625) 


0.085682(731) 


24 3 x 32 


w = 0.2, n = 25 


1981 


2.8 


0.006805(313) 


0.036933(850) 


24 3 x 48 


none 


52712 


2.8 


0.006564(75) 


0.036275(207) 


24 3 x 48 


w = 0.2, n = 25 


1981 



Table 3: Lattice spacing results. 
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6. Conclusion 

The use of GPUs can improve dramatically the speed of pure gauge SU(2) lattice 
computations. Using 2 NVIDIA Geforce 480 GTX GPUs in a desktop computer, we 
achieve 200x the computation speed over one CPU core, in single precision, around 110 
Gflops/s using two Fermi GPUs. We obtain excellent benchmarks because our compu- 
tation is integrally performed in the GPU. Our code can be downloaded from the site of 
the Portuguese Lattice QCD collaboration [16]. 

The use of textures can increase the speed of memory access when memory access 
patterns are very complicated and the shared memory cannot be used, although the 
maximum array size, when using textures, is limited. Taking advantage of the cache 
hierarchy introduced in the last architecture, allowed to have similar performance results 
when accessing to the memory and without having limitations in the array size. 

When using multiple GPUs we can improve the speed, making the overlap between 
computation and data transfers, however this was not yet implemented in the code. In 
the future, we will implement this using cudaMemcpyAsync () and streams. We have 
used cudaMemcpy ( ) to perform the data transfers. When this function is used, the 
control is returned to the host thread only after the data transfer is complete. With 
cudaMemcpyAsync (), the control is returned immediately to the host thread. The asyn- 
chronous transfer version requires pinned host memory and an additional argument, a 
stream ID. A stream is simply a sequence of sorted in time operations, performed in 
order on the GPU. Therefore, operations in different streams can be interleaved and in 
some cases overlapped, a property that can be used to hide data transfers between the 
host (CPU) and the device (GPU). 

We exploit our computational power to compute benchmarks for the Monte Carlo 
generation of SU(2) lattice gauge configurations, for the plaquette and Polyakov loop 
expectation values, and for the static quark-antiquark potential with Wilson loops. We 
are able to verify, utilizing a very large number of configurations, that the APE smearing 
does not distort the static quark-antiquark potential. 
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